Introduction

blablabla

Dataset description

The datasets used in the below analysis were sourced from www.kaggle.com website 1. They were created based on several sources including the Bureau of Justice Statistics 2 and FBI Uniform Crime Reporting Program 3. The National Prisoner Statistics Program conducted by the Bureau of Justice Statistics has collected data on the number of prisoners in state and federal prison facilities since 1926. It is produced annually on national and state level. Data are sourced from the 50 state departments of correction, the Federal Bureau of Prisons, and until 2001, from the District of Columbia. The UCR Program provides statistics on violent crime (murder and nonnegligent manslaughter, rape, robbery, and aggravated assault) and property crime (burglary, larceny-theft, and vehicle theft). Data are collected annually and are available on national, state and city level. For the purposes of our analysis we are using state-level statistics.

Additionally, we individually collected data on prison expenditures provided by the Bureau of Justice Statistics[^4] for each state in 2016 which is the lastest data available. Later in the analysis we will use them in order to correlate the expendirutes with the occurence of particular crimes.

prison <- read_csv("data/prison_custody_by_state.csv")
ucr <- read_csv("data/ucr_by_state.csv")
prison_exp_2016 <- read_delim("data/prison_expenditures.csv", ";")

Data preprocessing

Dataframe shape and missing values

Prison dataset

The prison data, compared to ucr is in a panel form, consisting of years as columns. Using long_panel we converted the dataframe so that each row is a different jurisdiction and year.

colnames(prison)[3:18] <- paste0(colnames(prison)[3:18],'1')
prison_panel <- long_panel(prison, begin = 2001, end = 2016, label_location = "beginning", id = "jurisdiction")
names(prison_panel)[names(prison_panel) == "wave"] <- "year"
names(prison_panel)[names(prison_panel) == "1"] <- "prison"
kable(head(prison_panel))
jurisdiction year includes_jails prison
Alabama 2001 0 24741
Alabama 2002 0 25100
Alabama 2003 0 27614
Alabama 2004 0 25635
Alabama 2005 0 24315
Alabama 2006 0 24103

UCR dataset

(…) opis zmiennych czym są, co oznaczają

The ucr dataset has a lot of missing values, compared to the other datasets that have none. We dropped the last 6 columns that were completely empty and then we dropped rows consisting of only missing values. It leaves all columns without any missing values apart from “rape_revised” with 612 missing values and “rape_legacy” with 104 missing values.

# removing last 6 columns
ucr <- ucr[, -c(16:21)]
# removing all missing rows
ind <- apply(ucr, 1, function(x) all(is.na(x)))
ucr <- ucr[ !ind, ]
# showing sum of missing values per columns
sapply(ucr, function(x) sum(is.na(x)))
##           jurisdiction                   year crime_reporting_change 
##                      0                      0                      0 
##       crimes_estimated       state_population    violent_crime_total 
##                      0                      0                      0 
##    murder_manslaughter            rape_legacy           rape_revised 
##                      0                    104                    612 
##                robbery            agg_assault   property_crime_total 
##                      0                      0                      0 
##               burglary                larceny          vehicle_theft 
##                      0                      0                      0

As you can see on plot on the left below, in the last two years, 2016 and 2017, there is an additional obervation ie. jurisdiction. Looking at the plot on the right, New York is missing in one year, Puerto Rico is visible in only 3 years. District of Columbia is sometimes renamed as DC, but overall it sums up to all 17 years.

plot.data1 = ucr %>% group_by(year) %>% count()
ggp1 = ggplot(data = plot.data1, aes(x=year, y = n)) + geom_bar(stat="identity")+
  theme(axis.title.x=element_blank())

plot.data2 = ucr %>% group_by(jurisdiction) %>% count() %>% arrange(n) %>% filter(n<17)
ggp2 = ggplot(data = plot.data2, aes(x=jurisdiction, y = n)) + geom_bar(stat="identity")+
  theme(axis.title.x=element_blank())

grid.arrange(ggp1, ggp2, ncol = 2)

We renamed “DC” to “District of Columbia” and dropped observiations with “Puerto Rico”. We also (…)uzupełnienie wartości Nowego Jorku np średnia.

ucr$jurisdiction[ucr$jurisdiction=="DC"] <- "District of Columbia"
ucr <- ucr %>% filter(jurisdiction!="Puerto Rico")

The remaining missing columns (…) czy wyrzucamy czy zostawiamy czy potem cos z nimi zrobimy, co to w ogole za zmienne

rape_df <- data.frame(year=2001:2017)
rape_revised_count <- ucr[!is.na(ucr$rape_revised),] %>% 
                            group_by(year) %>% 
                            count(name="rape_revised_count")
rape_legacy_count <- ucr[!is.na(ucr$rape_legacy),] %>% 
                            group_by(year) %>% 
                            count(name="rape_legacy_count")
rape_df <- left_join(rape_df, rape_revised_count, by="year")
rape_df <- left_join(rape_df, rape_legacy_count, by="year")

Hide data

Show data

kable(rape_df)
year rape_revised_count rape_legacy_count
2001 NA 51
2002 NA 51
2003 NA 51
2004 NA 51
2005 NA 51
2006 NA 51
2007 NA 51
2008 NA 51
2009 NA 51
2010 NA 51
2011 NA 51
2012 NA 51
2013 51 51
2014 51 51
2015 50 50
2016 51 NA
2017 51 NA

Unifying state names

In the prison dataset, District of Columbia is named as Federal and in prison_exp_2016 is named as Washington, D.C., so in order to unify the names we ranamed both to District of Columbia. We also renamed the variable State and type of government to jurisdiction for easier further calculations.

setdiff(prison$jurisdiction %>% unique(), ucr$jurisdiction %>% unique())
## [1] "Federal"
setdiff(prison_exp_2016$`State and type of government` %>% unique(), ucr$jurisdiction %>% unique())
## [1] "Washington, D.C."
prison$jurisdiction[prison$jurisdiction=="Federal"] <- "District of Columbia"
names(prison_exp_2016)[names(prison_exp_2016) == "State and type of government"] <- "jurisdiction"
prison_exp_2016$jurisdiction[prison_exp_2016$jurisdiction=="Washington, D.C."] <- "District of Columbia"

Background

According to recent surveys regarding the United States expenditures, spendings on incarceration have increased about three times as fast as spendings on elementary and secondary education during this time period. (…)

Statistical analysis of the dataset

jaka jest zależność między liczbą więźniów (prison) a wystąpieniami poszczególnych crime na przestrzeni lat (ucr)? czy wzrost uwięzionych zminiejsza odsetek jakiegoś typu przestępstw? czy może jest stały wzrost/spadek przestępstw? (geom line i geom smooth)

Does this significant investment into imprisonment improve public safety? wydatki na więzienia a wystąpienia przestępstw - ogółem i w kategoriach, w roku 2016 (najnowsze dane); source: https://www.bjs.gov/index.cfm?ty=dcdetail&iid=286

jak wygląda liczba uwięzionych na przestrzeni lat? dla całego kraju i dla poszczególnych stanów?

prison_country <- prison[,c(3:18)]
prison_country <- sapply(prison_country, sum)
df <- stack(prison_country)
colnames(df) <- c("value", "year")
p <- ggplot(data = df, aes(x = year, y = value, group = 1, 
            text = paste("Year: ", year,
                         "<br>Number of prisoners:", value))) +
  geom_line() + 
  geom_point() + 
  # scale_color_viridis() + 
  # scale_fill_viridis() +
  labs(title = "Number of prisoners in the USA by year", x = "Year", y = "Number of prisoners") +
  theme_minimal()

ggplotly(p, tooltip = "text")

dodatkowe zmienne -> area (ok) - w kodzie -> wydatki na prisons (ok) - w excelu , 2016

-> co poza mapą i bombelkami? - heatmapa -

-> https://www.datanovia.com/en/blog/top-r-color-palettes-to-know-for-great-data-visualization/